This report looks at exploring the relationship between wastewater and cases. There are four components to this analysis.

  1. Removing putative outliers

  2. Binning analysis

  3. Smoothing signal

  4. Statistical analysis

This report does not present any final answers but presents some very convincing heuristics.

1 Data: The first look

The two data sets used in this analysis are the Madison case data sourced from the Wisconsin DHS and wastewater concentration data produced by the Wisconsin State Laboratory of Hygiene. This wastewater data has entries every couple of days from November 15 2020 to June 24 2021.

Date Site Cases SevenDayMACases N1 BCoV N2 PMMoV GeoMeanN12 FlowRate temperature equiv_sewage_amt
2021-01-25 Madison 53 99.85714 135003 11.44000 233165 31732106 177420.3 37.28 NA 1
2021-01-26 Madison 60 98.85714 1831813 6.93000 1686654 29748189 1757735.7 37.74 NA 1
2021-01-27 Madison 177 97.71429 1357329 2.05900 1323010 30102158 1340059.6 36.17 NA 1
2021-01-28 Madison 111 94.28571 231625 2.39800 265869 41230863 248157.0 37.07 NA 1
2021-01-31 Madison 57 95.42857 256082 3.75643 233472 26460001 244515.8 36.30 NA 1
2021-02-01 Madison 46 94.14286 149707 1.86200 108584 27005372 127498.2 36.28 NA 1

The case data has a strong weekend effect so for this section we look at a seven day smoothing of cases. The simple display of the data shows the core components of this story. First that wastewater data is very noisy. And that there is a hint of a relationship between the two signals.

FirstImpressionDF <- FullDF%>%
  NoNa(SelectedIndVar,"Cases")

FirstImpression <- FirstImpressionDF%>%#Removing outliers
  ggplot(aes(x=Date))+#Data depends on time
  geom_point(aes(y=(Cases), color="Cases",info=Cases),size = 1)+
  geom_line(aes(y=MinMaxFixing(!!sym(SelectedIndVar),Cases), 
                color=SelectedIndVar,
                info=!!sym(SelectedIndVar)))+#compares SelectedIndVar to Cases
  geom_line(aes(y=(SevenDayMACases), 
                color="Seven Day MA Cases",
                info=Cases))+
  labs(y="Reported cases")+
  ColorRule


ggplotly(FirstImpression)%>%
    add_lines(x=~Date, y=FirstImpressionDF[[SelectedIndVar]],
            yaxis="y2", data=FirstImpressionDF, showlegend=FALSE, inherit=FALSE) %>%
    layout(yaxis2 = SecondAxisFormat,
           legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))
#To remoive weekend effects we are looking at the 7 day smoothing of cases.

2 Removing potential outliers

Looking at the wastewater measurements we observe minimal movement in N1 until it spikes up to expected levels. This is a sign that N1 before 11/20/2020 might have failed to capture the true amount of Covid shed in the community. To get the best picture between N1 and cases we removed these points from the analysis. In addition there were some points many times larger than adjacent values hinting at them being outliers. For the same reason we removed these points from the analysis.

DaySmoothed=21#Very wide smoothing to find where the data strong deviate from trend
ErrorMarkedDF <- FullDF%>%#Remove older Var data that clearly has no relationship to Cases
    mutate(FlagedOutliers = IdentifyOutliers(!!sym(SelectedIndVar), Action = "Flag"),
           ReplacedOutliers = ifelse(FlagedOutliers,NA,!!sym(SelectedIndVar)),
           ReplacedOutliers = ifelse(Date == mdy("01/26/2021") | Date == mdy("01/27/2021"), NA, ReplacedOutliers))


#Calculating error Limits
#EarlyOutlier | LargeError | LargeError2| LargeError3
ErrorRemovedDF <- ErrorMarkedDF%>%
  mutate(!!OutlierName := ifelse(FlagedOutliers,!!sym(SelectedIndVar),NA))%>%
  mutate(!!SelectedIndVar := ReplacedOutliers)

#!is.na(Cases),
#SevenDayMACases
Ref <- ErrorRemovedDF%>%
  filter(!(is.na(!!sym(SelectedIndVar))&is.na(!!sym(OutlierName))))%>%
  ggplot(aes(x=Date))+#Data depends on time
  geom_line(aes(y=!!sym(SelectedIndVar),
                color=SelectedIndVar, 
                info = !!sym(SelectedIndVar)))+#compares Var to Cases
  geom_point(aes(y=!!sym(OutlierName),
                 color=OutlierName,
                 info = !!sym(OutlierName)))+
  ColorRule


ggplotly(Ref,tooltip=c("info","Date"))%>%
    layout(yaxis = SecondAxisFormat,
           legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))
UpdatedDF <- ErrorRemovedDF

3 Binning

To isolate this relationship we used a primitive binning relationship. We used non overlapping bins of 2 weeks and took the median of the data within that range. This reduces autocorrelation issues and masks potential noise in the data. We see a very strong trend once the potential outliers are removed.

#StartDate is Where the binning starts
#DaySmoothing is The size of the bins
#Lag is The offset between Cases and Var
BinnedVarName <- paste("Binned",SelectedIndVar)
Bining <- function(DF,StartDate=1,DaySmoothing=14,Lag=0){
  BinDF <- DF%>%
    select(Date, Cases, !!sym(SelectedIndVar))%>%
    mutate("Moved Cases" = data.table::shift(Cases, Lag),#moving  SLD lag days backwards
        Week=(as.numeric(Date)+StartDate)%/%DaySmoothing)%>%#putting variables into bins via integer division
    group_by(Week)%>%
  #filter(Week>2670)%>%
    summarise("Binned Cases" := median(!!sym("Moved Cases"), na.rm=TRUE), 
              !!BinnedVarName := (median((!!sym(SelectedIndVar)),
        na.rm=TRUE)), 
        Date = mean(Date,na.rm = TRUE))#summarize data within bins.
  return(BinDF)
}

BinErrorRemovedDF <- Bining(UpdatedDF)
BinErrorKeptDF <- Bining(FullDF)

DiffrenceDF <- inner_join(BinErrorRemovedDF,BinErrorKeptDF,by=c("Week","Date"))%>%
  filter(!!paste0(BinnedVarName,".x") != !!paste0(BinnedVarName,".y"))

BinedCorrGraph <- ggplot()+
  geom_segment(aes(x = !!sym("Binned Cases.x"), 
                   y = !!sym(paste0(BinnedVarName,".x")), 
                   xend = !!sym("Binned Cases.y"),
                   yend = !!sym(paste0(BinnedVarName,".y"))),
                data = DiffrenceDF)+
  geom_point(aes(x = !!sym("Binned Cases"), 
                 y = !!sym(BinnedVarName),
                 color = "outliers not removed",
                 info = Date),
             size = 2, 
             data = BinErrorKeptDF,
             shape=15)+
  geom_point(aes(x = !!sym("Binned Cases"), 
                 y = !!sym(BinnedVarName),
                 color = "outliers removed",
                 info = Date),
             data = BinErrorRemovedDF)+
  ggtitle(paste(BinnedVarName,",Cases removed potential outliers"))+
  #geom_abline(slope = 3000)+
  labs(x="Binned Cases",y=BinnedVarName)

ggplotly(BinedCorrGraph,tooltip=c("x","y","info"))%>%
    layout(legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))
#cor(BinDF$BinnedN1, BinDF$BinnedCases, use="pairwise.complete.obs")
#summary(lm(BinnedCases~BinnedN1, data=BinDF))
OutputBinning <- data.frame(row.names=c("correlation"),
  WithOutliers = c(cor(BinErrorKeptDF[[BinnedVarName]], 
                       BinErrorKeptDF[["Binned Cases"]],
                       use="pairwise.complete.obs")),
  
  WithOutOutliers = c(cor(BinErrorRemovedDF[[BinnedVarName]],
                          BinErrorRemovedDF[["Binned Cases"]],
                          use="pairwise.complete.obs")))
formattable(OutputBinning)
WithOutliers WithOutOutliers
correlation 0.9040724 0.9130309

4 Data smoothing

The goal in this section is to smooth the data to get a similar effect without losing resolution.

4.1 Case smoothing

A key component to this is that the relationship between N1 and Case involves a gamma distribution modeling both the time between catching Covid-19 and getting a test and the concentration of the shedded particles. We found a gamma distribution with mean 11.73 days and a standard deviation of 7.68 gives good results and match’s other research (Fernandez-Cassi et al. 2021).

Mean <- 11.73
StandardDeviation <- 7.68
Scale = StandardDeviation^2/Mean
Shape = Mean/Scale
SLDWidth <- 21

weights <- dgamma(1:SLDWidth, scale = Scale, shape = Shape)
par(mar=c(4,4,4,10))
plot(weights,  
        main=paste("Gamma Distribution with mean =",Mean, "days,and SD =",StandardDeviation), 
            ylab = "Weight", 
            xlab = "Lag")

SLDSmoothedDF <- UpdatedDF%>%
  mutate(
    SLDCases = c(rep(NA,SLDWidth-1),#elimination of starting values not relevant as we have a 50+ day buffer of case data
                        rollapply(Cases,width=SLDWidth,FUN=weighted.mean,
                                  w=weights,
                                  na.rm = FALSE)
                 #,rep(NA,10)
                 ))#no missing data to remove

#CasesMinMaxPrep = MinMaxNormalization(SLDSmoothedDF$SLDCases,Prep=TRUE)
SLDPlot = SLDSmoothedDF%>%
  NoNa("SLDCases")%>%#same plot as earlier but with the SLD smoothing
  ggplot(aes(x=Date))+
  geom_line(aes(y=Cases, 
                color="Cases" , info = Cases),alpha=.2)+
  geom_line(aes(y=SevenDayMACases,
                color="Seven Day MA Cases" , info = SevenDayMACases),alpha=.4)+
  geom_line(aes(y=SLDCases, color="SLD Cases",info = SLDCases))+
  labs(y="Reported Cases")+
  ColorRule

ggplotly(SLDPlot,tooltip=c("info","Date"))%>%
  layout(legend=list(title=list(text=''),x = 1.15, y = 0.9),
         margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))

4.2 viral load smoothing

To get a good smoothing of the N1 measurement we employ loess smoothing. Loess smoothing takes a locally weighted sliding window using some number of points. we found the best smoothing when it uses data within approximately 3 weeks of both sides of the data. The displayed plot shows the visual power of this smoothing. We see in general the smoothing trails SLD. However loess is symmetric meaning that it can not be used in predictive modeling due to it using points from the future to smooth points.

SpanConstant = .165

SLDSmoothedDF[[loessVar]] <- loessFit(y=(SLDSmoothedDF[[SelectedIndVar]]), 
                      x=SLDSmoothedDF$Date, #create loess fit of the data
                      span=SpanConstant, #span of .2 seems to give the best result, not rigorously chosen
                      iterations=2)$fitted#2 iterations remove some bad patterns


SLDLoessGraphic <- SLDSmoothedDF%>%
  NoNa(loessVar,"SLDCases")%>%
  ggplot(aes(x=Date))+
  geom_line(aes(y=Cases, color="Cases" , info = Cases),alpha=.1)+
  geom_line(aes(y=MinMaxFixing(!!sym(SelectedIndVar),Cases),
                color=SelectedIndVar,
                info = !!sym(SelectedIndVar)),
            alpha=.2)+
  geom_line(aes(y=SevenDayMACases,
                color="Seven Day MA Cases" , 
                info = SevenDayMACases),
            alpha=.3)+
  geom_line(aes(y=MinMaxFixing(!!sym(loessVar),Cases,!!sym(SelectedIndVar)), 
                color=loessVar ,
                info = !!sym(loessVar)))+
  geom_line(aes(y=SLDCases,
                color="SLD Cases" ,
                info = SLDCases))+
  labs(y="Reported cases")+
  ColorRule


ggplotly(SLDLoessGraphic,tooltip=c("info","Date"))%>%
    add_lines(x=~Date, y=SLDSmoothedDF[[SelectedIndVar]],
            yaxis="y2", data=SLDSmoothedDF, showlegend=FALSE, inherit=FALSE) %>%
    layout(yaxis2 = SecondAxisFormat,
           legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))
RelationShipDF <- SLDSmoothedDF%>%
  select(Date, Cases, SevenDayMACases, SLDCases, N1, N2, BCoV, PMMoV, loessN1, FlowRate, DetectedOutlier, temperature, equiv_sewage_amt)

RelationShipDF%>%
  filter(!is.na(DetectedOutlier))


DiffDF <- RelationShipDF%>%
  mutate(DiffLoessN1 = c(NA,diff(loessN1)),
         DiffSLDCases = c(NA,diff(SLDCases)))%>%
  select(Date,DiffLoessN1,DiffSLDCases)

DiffDF%>%
  NoNa("DiffLoessN1")%>%
  ggplot()+
  aes(x=Date)+
  geom_line(aes(y=DiffSLDCases,color="DiffSLDCases"))+
  geom_line(aes(y=MinMaxFixing(DiffLoessN1,DiffSLDCases),color="DiffLoessN1"))

x <- ccf(RelationShipDF$loessN1,RelationShipDF$SLDCases,na.action=na.pass)
max(x$acf)

5 Towards a formal analysis

Cross correlation and Granger Causality are key components to formalize this analysis. Cross correlation looks at the correlation at a range of time shifts and Granger analysis performs a test for predictive power. These results must be kept in mind that the data removal was not done rigorously and the resulting data set has aspects of being non stationary. Therefore these statistics are good to keep in mind they should not be used for concrete results.

CCFChar <- function(ccfObject){
  LargestC = max(ccfObject$acf)
  Lag = which.max(ccfObject$acf)-21
  return(c(LargestC,Lag))
}

ModelTesting <- function(DF,Var1,Var2){
  UsedDF <- DF%>%
    NoNa(Var1,Var2)#removing rows from before both series started
  
  
  Vec1 <- unname(unlist(UsedDF[Var1]))
  Vec2 <- unname(unlist(UsedDF[Var2]))

  CCFReport <- CCFChar(ccf(Vec1,Vec2,na.action=na.pass,plot = FALSE))

  VarPredCase <- grangertest(Vec1, Vec2, order = 1)$"Pr(>F)"[2]
  CasePredVar <- grangertest(Vec2,Vec1, order = 1)$"Pr(>F)"[2]
  return(round(c(CCFReport,CasePredVar,VarPredCase),4))
}

#ErrorRemovedDF
BaseLine <- ModelTesting(FullDF,SelectedIndVar,"Cases")
BaseLineSevenDay <- ModelTesting(FullDF,SelectedIndVar,"SevenDayMACases")
ErrorRemoved <- ModelTesting(UpdatedDF,SelectedIndVar,"SevenDayMACases")
SLDVar <- ModelTesting(SLDSmoothedDF,SelectedIndVar,"SLDCases")
SevenLoess <- ModelTesting(SLDSmoothedDF,loessVar,"SevenDayMACases")
SLDLoess <- ModelTesting(SLDSmoothedDF,loessVar,"SLDCases")


Output <- data.frame(row.names=c("Max Cross Correlation","Lag of largest Cross correlation","P-value WasteWater predicts Cases","P-value Cases predicts wastewater"),
  CasesvsVar = BaseLine,
  SevenDayMACasesvsVar = BaseLineSevenDay,
  ErrorRemoved = ErrorRemoved,
  SLDVar = SLDVar,
  SevenLoess = SevenLoess,
  SLDLoess = SLDLoess)

OutputRightPosition <- transpose(Output)
colnames(OutputRightPosition) <- rownames(Output)
rownames(OutputRightPosition) <- c(paste("Section 1: Cases vs" , SelectedIndVar),
                                   paste("Section 1: 7 Day MA Cases vs" , SelectedIndVar),
                                   paste("Section 2: Cases vs" , SelectedIndVar),
                                   paste(" Section 4.1: SLD Cases vs ",SelectedIndVar),
                                   paste("Section 4.2: 7 Day MA Cases vs Loess smoothing of ",SelectedIndVar),
                                   paste("Section 4.2: SLD Cases vs Loess smoothing of ",SelectedIndVar))

formattable(OutputRightPosition)
Max Cross Correlation Lag of largest Cross correlation P-value WasteWater predicts Cases P-value Cases predicts wastewater
Section 1: Cases vs N1 0.6137 1 0.0000 0.0000
Section 1: 7 Day MA Cases vs N1 0.7066 1 0.0000 0.0002
Section 2: Cases vs N1 0.8275 1 0.0000 0.0000
Section 4.1: SLD Cases vs N1 0.8107 1 0.0000 0.1902
Section 4.2: 7 Day MA Cases vs Loess smoothing of N1 0.9263 1 0.0113 0.0000
Section 4.2: SLD Cases vs Loess smoothing of N1 0.8834 1 0.0000 0.0000
Fernandez-Cassi, Xavier, Andreas Scheidegger, Carola Bänziger, Federica Cariti, Alex Tuñas Corzon, Pravin Ganesanandamoorthy, Joseph C. Lemaitre, Christoph Ort, Timothy R. Julian, and Tamar Kohn. 2021. “Wastewater Monitoring Outperforms Case Numbers as a Tool to Track COVID-19 Incidence Dynamics When Test Positivity Rates Are High.” Water Research 200: 117252. https://doi.org/10.1016/j.watres.2021.117252.